Object Recognition and Classification

At this point, you should have a basic understanding of TensorFlow and its best practices. We'll follow these practices while we build a model capable of object recognition and classification. Building this model expands on the fundamentals that have been covered so far while adding terms, techniques and fundamentals of computer vision. The technique used in training the model has become popular recently due to its accuracy across challenges.

ImageNet, a database of labeled images, is where computer vision and deep learning saw a recent rise in popularity. Annually, ImageNet hosts a challenge (ILSVRC) where people build systems capable of automatically classifying and detecting objects based on ImageNet's database of images. In 2012, the challenge saw a team named SuperVision submit a solution using a creative neural network architecture. ILSVRC solutions are often creative but what set SuperVision's entry apart was its ability to accurately classify images. SuperVision's entry set a new standard for computer vision accuracy and stirred up interest in a deep learning technique named convolutional neural networks.

Convolutional neural networks (CNNs) have continued to grow in popularity. They're primarily used for computer vision related tasks but are not limited to working with images. CNNs could be used with any data which can be represented as a tensor where values are ordered next to related values (in a grid). Microsoft Research released a paper in 2014 where they used CNNs for speech recognition where the input tensor was a single row grid of sound frequencies ordered by the time they were recorded. For images, the values in the tensor are pixels ordered in a grid corresponding with the width and height of the image.

In this chapter, the focus is working with CNNs and images in TensorFlow. The goal is to build a CNN model using TensorFlow that categorizes images based on a subset of ImageNet's database. Training a CNN model will require working with images in TensorFlow and understanding how convolutional neural networks (CNNs) are used. The majority of the chapter is dedicated to introducing concepts of computer vision using TensorFlow.

The dataset used in training this CNN model is a subset of the images available in ImageNet named the Stanford's Dogs Dataset. As the name implies, this dataset is filled with images of different dog breeds and a label of the breed shown in the image. The goal of the model is to take an image and accurately guess the breed of dog shown in the image.

Example images tagged as "Siberian Husky" from Stanford's Dog Dataset.


If one of the images shown above is loaded into the model, it should output a label of Siberian Husky. These example images wouldn't be a fair test of the model's accuracy because they exist in the training dataset. Finding a fair metric to calculate the model's accuracy requires a large number of images which won't be used in training. The images which haven't been used in training the model will be used to create a separate test dataset.

The reason to bring up the fairness of an image to test a model's accuracy is because it's part of keeping a separated test, train and cross-validation datasets. While processing input, it is a required practice to separate a large percentage of the data used to train a network. This separation is to allow a blind test of a model. Testing a model with input which was used to train it will likely create a model which accurately matches input it has already seen while not being capable of working with new input. The testing dataset is then used to see how well the model performs with data which didn't exist in the training. Over time and iterations of the model, it is possible that the changes being made to increase accuracy are making the model better equipped to the testing dataset while performing poorly in the real world. A good practice is to use a cross-validation dataset to check the final model and receive a better estimate of its accuracy. With images, it's best to separate the raw dataset while doing any preprocessing (color adjustments or cropping) keeping the input pipeline the same across all the datasets.


In [1]:
# setup-only-ignore
import tensorflow as tf
import numpy as np

In [2]:
# setup-only-ignore
sess = tf.InteractiveSession()

Convolutional Neural Networks

Technically, a convolutional neural network is a neural network which has at least one layer (tf.nn.conv2d) which does a convolution between its input \\(f\\) and a configurable kernel \\(g\\) generating the layer's output. In a simplified definition, a convolution's goal is to apply a kernel (filter) to every point in a tensor and generate a filtered output by sliding the kernel over an input tensor.

An example of the filtered output is edge detection in images. A special kernel is applied to each pixel of an image and the output is a new image depicting all the edges. In this case, the input tensor is an image and each point in the tensor is treated as a pixel which includes the amount of red, green and blue found at that point. The kernel is slid over every pixel in the image and the output value increases whenever there is an edge between colors.

This shows the simplified convolution layer where the input is an image and the output is all the horizontal lines found in the image.


It isn't important to understand how convolutions combine input to generate filtered output or what a kernel is until later in this chapter when they're put in practice. Obtaining a broad sense of what a CNN does and its biological inspiration build the technical implementation.

In 1968, an article was published detailing new findings on the cellular layout of a monkey striate cortex (the section of the brain thought to process visual input). The article discusses a grouping of cells which extend vertically combining together to match certain visual traits. The study of primate brains may seem irrelevant to a machine learning tasks yet it was instrumental in the development of deep learning using CNNs.

CNNs follow a simplified process matching information similar to the structure found in the cellular layout of a monkey's striate cortex. As signals are passed through a monkey's striate cortex, certain layers will signal when a visual pattern is highlighted. For example, one layer of cells will activate (increase its output signal) when a horizontal line passes through it. A CNN will exhibit a similar behaviour where clusters of neurons will activate based on patterns learned from training. For example, after training, a CNN will have certain layers which activate when a horizontal line passes through it.

Matching horizontal lines would be a useful neural network architecture but CNNs take it further by layering multiple simple patterns to match complex patterns. In the context of CNNs, these patterns are known as filters or kernels and the goal is to adjust these kernel weights until they accurately match the training data. Training these filters is often accomplished by combining multiple different layers and learning weights using gradient descent.

A simple CNN architecture may combine a convolutional layer (tf.nn.conv2d), non-linearity layer (tf.nn.relu), pooling layer (tf.nn.max_pool) and a fully connected layer (tf.matmul). Without these layers, it's difficult to match complex patterns because the network will be filled with too much information. A well designed CNN architecture highlights important information while ignoring noise. We'll go into details on how these layers work together later in this chapter.

The input image for this architecture is a complex format designed to support the ability to load batches of images. Loading a batch of images allows the computation of multiple images simultaneously but it requires a more complex data structure. The data structure used is a rank four tensor including all the information required to convolve a batch of images. TensorFlow's input pipeline (which is used to read and decode files) has a special format designed to work with multiple images in a batch including required information for an image ([image_batch_size, image_height, image_width, image_channels]). Using the example code, it's possible to examine the structure of an example input used while working with images in TensorFlow.


In [3]:
image_batch = tf.constant([
        [  # First Image
            [[0, 255, 0], [0, 255, 0], [0, 255, 0]],
            [[0, 255, 0], [0, 255, 0], [0, 255, 0]]
        ],
        [  # Second Image
            [[0, 0, 255], [0, 0, 255], [0, 0, 255]],
            [[0, 0, 255], [0, 0, 255], [0, 0, 255]]
        ]
    ])
image_batch.get_shape()


Out[3]:
TensorShape([Dimension(2), Dimension(2), Dimension(3), Dimension(3)])

NOTE: The example code and further examples in this chapter do not include the common bootstrapping required to run TensorFlow code. This includes importing the tensorflow (usually as tf for brevity), creating a TensorFlow session as sess, initializing all variables and starting thread runners. Undefined variable errors may occur if the example code is executed without running these steps.

In this example code, a batch of images is created which include two images. Each image has a height of two pixels and a width of three pixels with an RGB color space. The output from executing the example code shows the amount of images as the size of the first set of dimensions Dimension(2), the height of each image as the size of the second set Dimension(2), the width of each image as the third set Dimension(3) and the size of the color channel as the final set Dimension(3).

It's important to note each pixel maps to the height and width of the image. Retrieving the first pixel of the first image requires each dimension accessed as follows.


In [4]:
sess.run(image_batch)[0][0][0]


Out[4]:
array([  0, 255,   0], dtype=int32)

Instead of loading images from disk, the image_batch variable will act as if it were images loaded as part of an input pipeline. Images loaded from disk using an input pipeline have the same format and act the same. It's often useful to create fake data similar to the image_batch example above to test input and output from a CNN. The simplified input will make it easier to debug any simple issues. It's important to work on simplification of debugging because CNN architectures are incredibly complex and often take days to train.

The first complexity working with CNN architectures is how a convolution layer works. After any image loading and manipulation, a convolution layer is often the first layer in the network. The first convolution layer is useful because it can simplify the rest of the network and be used for debugging. The next section will focus on how convolution layers operate and using them with TensorFlow.